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Abstract — In this paper, we focus on the downlink of a cellular 
system, which corresponds to the bulk of the data transfer in 
such wireless systems. We address the problem of opportunistic 
multiuser scheduling under imperfect channel state information, 
by exploiting the memory inherent in the channel. In our setting, 
the channel between the base station and each user is modeled 
by a two-state Markov chain and the scheduled user sends back 
an ARQ feedback signal that arrives at the scheduler with a 
random delay that is i.i.d across users and time. The scheduler 
indirectly estimates the channel via accumulated delayed-ARQ 
feedback and uses this information to make scheduling decisions. 
We formulate a throughput maximization problem as a partially 
observable Markov decision process (POMDP). For the case of 
two users in the system, we show that a greedy policy is sum 
throughput optimal for any distribution on the ARQ feedback 
delay. For the case of more than two users, we prove that the 
greedy policy is suboptimal and demonstrate, via numerical 
studies, that it has near optimal performance. We show that 
the greedy policy can be implemented by a simple algorithm 
that does not require the statistics of the underlying Markov 
channel or the ARQ feedback delay, thus making it robust 
against errors in system parameter estimation. Fstablishing 
an equivalence between the two-user system and a genie-aided 
system, we obtain a simple closed form expression for the 
sum capacity of the Markov-modeled downlink. We further 
derive inner and outer bounds on the capacity region of the 
Markov-modeled downlink and tighten these bounds for special 
cases of the system parameters. 

Index Terms - Opportunistic multiuser scheduling, cellular down- 
Unk, Markov channel, ARQ feedback, delay, greedy policy, sum 
capacity, capacity region. 



I. Introduction 

With the ever increasing demand for high data rates, op- 
portunistic multiuser scheduling, introduced by Knopp and 
Humblet in [1], and defined as allocating the resources to the 
user experiencing the most favorable channel conditions, has 
gained immense popularity among wireless network designers. 
Opportunistic multiuser scheduling essentially exploits the 
multiuser diversity in the system and has motivated several 
researchers (e.g., Q- ||6|) to study the performance gains 
obtained by opportunistic scheduling under various scenarios. 
While the i.i.d flat fading model is used in these works 
to model time varying channels (for a general treatment on 
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opportunistic scheduling with minimal assumptions on the 
channel, see |7|), it fails to capture the memory in the channel 
observed in realistic scenarios. Hence, more recently, oppor- 
tunistic scheduling has also been investigated by modeling the 
channels by Markov chains (e.g., |8|-|13|). However, in these 
works, the channel state information that is crucial for the 
success of any opportunistic scheduling scheme is assumed 
to be readily available at the scheduler This is a simplifying 
assumption that does not hold in reality, where a non-trivial 
amount of resource must be spent in gathering the informa- 
tion on the channel state. Another line of work (e.g., |fT4ll . 
ifTSi ) attempts to exploit the memory in the Markov-modeled 
channels to gather this information. Specifically, Automatic 
Repeat reQuest (ARQ) feedback, that is traditionally used for 
error control (e.g., ||T6| - ||T9| ) at the data link layer, is used to 
estimate the state of the Markov-modeled channels. 

These two lines of work can be combined to create a 
new design paradigm: exploit multiuser diversity in Markov- 
modeled channels (e.g., |8|-|13|) and use the already ex- 
isting ARQ feedback mechanism to estimate the state of 
these Markov-modeled channels (e.g., lfT4l . ifTSl ). Assuming 
instantaneous ARQ feedback (i.e., it arrives at the end of the 
slot) and ON-OFF Markov channel model (the Gilbert-Elliott 
model II20I ). this problem was addressed in independent works 
II2TII . II22I . In fT\\, the authors studied opportunistic spectrum 
access in a cognitive radio setting — a setup mathemati- 
cally equivalent to the instantaneous ARQ based opportunistic 
scheduling in a Markov-modeled downlink — and showed 
that a simple greedy scheduling policy is optimal. In | l22l . 
we directly addressed the instantaneous ARQ based downlink 
scheduling problem. By identifying a special mathematical 
structure in the problem, we derived a closed form expression 
for the two-user sum capacity of the downlink and obtained 
bounds on the system stability region. 

In this paper, we model the downlink channels by two state 
(ON-OFF) Markov chains and study the ARQ based joint 
channel learning-scheduling problem when the ARQ feedback 
arrives at the scheduler with a random delay that is i.i.d 
across users and time. The delay in the feedback channel is an 
important consideration that cannot be overlooked in realistic 
scenarios. The effect of feedback delay on channel resource 
allocation has been studied under various settings in the past 
(e.g., |23|-|26|). While these works assume deterministic 
delay, we consider random, i.i.d feedback delay. An instance 
when the feedback delay can be i.i.d is when the delay is due 
to channel propagation time of the feedback signal and when 
the feedback channel environment changes drastically due to 
high mobility of users. In essence, by modeling the feedback 
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delay to be random, we attempt to capture the effect of the 
non-ideahties of the feedback channel on the joint channel 
learning-scheduling problem, in a more general framework. 

It turns out that, despite the random delay, the ARQ 
feedback can be used for opportunistic scheduling to achieve 
performance gains. A sample of this gain is illustrated in Fig.[T] 
for a specific set of system parameters to be defined in the next 
section. Fig.[T]plots the sum (over all the downlink users) rate 
of successful transmission of packets over a length of m slots 
under optimal opportunistic scheduling when the scheduler 
has: (a) randomly delayed channel state information (CSI) 
from all the downlink users (b) randomly delayed CSI from 
the scheduled user — i.e., randomly delayed ARQ feedback, 
and (c) no CSI - i.e., random scheduling. We make two 
observations from the figure: (1) Using delayed ARQ feedback 
for opportunistic scheduling can achieve performance close to 
opportunistic scheduling using delayed CSI from all users, and 
(2) a 49% gain (when m = 7) in the sum rate is associated 
with opportunistic scheduling using delayed ARQ over random 
scheduling. These observations motivate our approach: exploit 
multiuser diversity in Markov-modeled downlink channels 
using the already existing (albeit delayed) ARQ feedback 
mechanisms. 
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Fig. 1. Illustration of the gains associated witli opportunistic scheduling 
using randomly delayed ARQ feedback. System parameters used: p = 0.8700 
r = 0.1083 Poid = 0) = \, Poid = 1) = |, Paid = 2) = i, 
Pnid > 2) = 0, TTm = [0.3358 0.1851 0.5483]. 



When compared to the instantaneous ARQ case, the ran- 
domly delayed ARQ case adds additional layers of complexity 
to the scheduling problem, making it different and far more 
challenging than the former. However, we show that, when 
there are two users in the system, for any ARQ delay distri- 
bution, the greedy policy that was optimal in the instantaneous 
ARQ case Ii2ri is also optimal in the delayed ARQ case. 
For more than two users, however, using a counterexample, 
we show that the greedy policy is not, in general, optimal. 
Despite the suboptimality, extensive numerical experiments 
suggest that the greedy policy has near optimal performance. 
Encouraged by this insight, we study the structure of the 



greedy policy and show that it can be implemented via a 
simple algorithm that is immune to errors in the estimates of 
the Markov channel parameters and the ARQ delay statistics. 
We also study the fundamental limits of the Markov-modeled 
downlink with randomly delayed ARQ feedback. By estab- 
lishing an equivalence between the two-user downlink and a 
genie-aided system, we derive a simple closed form expression 
for the sum capacity of the two-user downlink, while obtaining 
bounds on the sum capacity for larger number of users. We 
further derive inner and outer bounds on the capacity region 
of the downlink and tighten these bounds for special cases of 
the system parameters. 

The rest of the paper is organized as follows. The problem 
setup is described in Section [III followed by a study of the 
optimality properties of the greedy policy in Section IIII-AI 
Section ITlI-B I contains a numerical performance analysis of the 
greedy policy. In Section ITlI-CI we discuss the implementation 
structure of the greedy policy. We then study the sum capacity 
and the capacity region of the Markov-modeled downlink in 
Section HV] followed by concluding remarks in Section |Vl 

II. Problem Setup 

A. Channel Model 

We consider downlink transmissions with N users. For each 
user, there is an associated queue at the base station that 
accumulates packets intended for that user. We assume that 
each queue is infinitely backlogged. The channel between 
the base station and each user is modeled by an i.i.d two- 
state Markov chain. Each state corresponds to the degree of 
decodability of the data sent through the channel. State 1 
(ON) corresponds to full decodability, while state (OFF) 
corresponds to zero decodability. Time is slotted and the 
channel of each user remains fixed for a slot and moves into 
another state in the next slot following the state transition 
probability of the Markov chain. The time slots of all users are 
synchronized. The two-state Markov channel is characterized 
by a 2 X 2 probability transition matrix 
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P 1 

r 1 — r 



(1) 



where 



p := Prob{channel is in ON state in the current slot| 

channel was in ON state in the previous slot} 
r Prob{channel is in ON state in the current slot| 

channel was in OFF state in the previous slot}. 

The states can be interpreted as a quantized representation 
of the underlying channel strength, which lies on a con- 
tinuum. It is known from classic works ll27l . 1281 that the 
fading channel, with reasonable accuracy, can be modeled 
by finite state Markov chains and that, in reality, the fading 
process is observed to be gradual enough that the state 
transitions/crossovers can be restricted to adjacent states of 
the Markov model. With the top 'half of the states in these 
models cumulatively represented by the ON state and the 
rest by the OFF state in our two-state model, we see that. 
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in realistic scenarios, the crossover from ON to OFF state 
(respectively, OFF to ON) is less likely to occur than staying in 
ON state (respectively, OFF state). This is positive correlation, 
i.e., p > r. Motivated by this, we restrict our attention to p > r 
throughout this work. 

B. Scheduling Problem 

The base station (henceforth known as the scheduler) is the 
central controller that controls the transmission to the users 
in each slot. In any time slot, the scheduler does not know 
the exact channel state of the users and it must schedule the 
transmission of the head-of-line packet of exactly one user. 
Thus, a TDMA styled scheduling is performed here. The 
power spent in each transmission is fixed. At the beginning 
of a time slot, the head-of-line packet of the scheduled user 
is transmitted. The scheduled user attempts to decode the 
received packet and based on the decodability of the packet 
sends back ACK(bit l)/NACK(bit 0) feedback signals to the 
scheduler at the end of the time slot, over an error-free 
feedback channel. The feedback channel is assumed to suffer 
from a random delay that is i.i.d across users and time. This 
delayed feedback information, along with the label of the time 
slot from which it is acquired, will be used by the scheduler 
in scheduling decisions. The scheduler aims to maximize the 
sum of the rate of successful transmission of packets to all the 
users in the system. We formally define the problem below. 

C. Formal Problem Definition 

Since the scheduler must make scheduling decisions based 
only on a partial observatiorQ of the underlying Markov chain, 
the scheduling problem can be represented by a partially 
observable Markov decision process (POMDP). See II29II for 
an overview of POMDPs. We now formulate our problem in 
the language of POMDPs. The key quantities used throughout 
this paper are summarized in Appendix |E] 

Horizon: The number of consecutive slots over which 
scheduling is performed is the horizon. We index the time 
slots in decreasing order with slot 1 corresponding to the end 
of the horizon. Throughout this paper, the horizon is denoted 
by m, i.e., the scheduling process begins at slot to. 

Feedback arriving at slot t: For some slot t, t < m, let 
n{t) be the number of ARQ feedback bits ({0, 1}) arriving 
at the end of slot t from the users scheduled in the previous 
slots. Due to the random nature of the feedback delay, n{t) 
can take values in the set {0, . . . ,m — t + l}. Let Ft represent 
all the ARQ feedback arriving at the end of slot t. Thus Ft G 
{0, if n{t) > and Ft = 0, if n{t) = 0. The ARQ 

feedback is time-stamped and thus, since the scheduler has a 
record on which users were scheduled in the past slots, it can 
map the feedback bits Ft to the users and slots they originated 
from. Let fk be the feedback that originated during slot k, 
where k < m. Note that since in each slot one and only one 
user is scheduled, is neither empty nor has multiple values, 
i-e., fk e {0, 1} with bit mapped to NACK and bit 1 to 
ACK feedback. 

'in this case, the set of time-stamped binary delayed feedbaclc on the 
channels. 



Delay of feedback from user i in slot t: Let D{i,t) be 
the random variable corresponding to the delay, in number of 
slots, experienced by the feedback sent by user i in slot t. Let 
D{i,t) — correspond to the case when the ARQ feedback 
originating from user i in slot t arrives at the scheduler at the 
end of the same slot t. We assume the distribution of D{i,t) 
to be i.i.d across users i and time t throughout this work, and 
let Po{d), d e {0, 1, . . .} denote the probability mass function 
of D. 

Belief value of user i in slot t - TTt{i): This represents the 
probability that the channel of user i G {1 . . . N}, in slot t, is 
in the ON state, given all the past feedback about the channel. 
Define T"(.), for u e {0, 1, . . .}, as the w-step belief evolution 
operator given by T"(x) = T(r("-i)(a;)) = T("-i)(T(x)) 
with T{x) = xp + (1 - x)r and T°{x) = x fov x <G [0, 1]. 
Now if, at the end of slot i + 1, the arriving feedback 
Ft+i contains the ARQ feedback from user i from slot 
k € {to, m ~ 1, . . . ,t + 1}, i.e., fk, then, if k is the latest 
slot from which an ARQ feedback from user i has arrived, 
then TTt{i) is obtained by applying the 1-step belief evolution 
operator repeatedly over all the time slots between 'now' (slot 
t) and slot k, i.e., 

rr'^-*(i) = r('=-*-i)(p), iffk = i 

\r^-*(0) = T('=-*-i)(r), if/fe = 0, 

where we have used r"(a;) = r"-i(r(x)). If k is not the 
latest slot from which an ARQ feedback from user i has 
arrived (possible since the random nature of the feedback delay 
can result in out-of-turn arrival of ARQ feedback), then due 
to the first-order Markovian nature of the channels, this ARQ 
feedback does not have any new information to affect the belief 
value, and so 7rf(i) = T{Tit+i{i))- Similarly, if Ft+i does not 
contain any feedback from user i, then Titii) = T{iTt+i{i))- 

Reward structure: In any slot t, a reward of 1 is accrued at 
the scheduler when the channel of the scheduled user is found 
to be in the ON state, else is accrued. 

Scheduling Policy %k'- A scheduling policy %k in slot k is 
a mapping from all the information available at the scheduler 
in slot k along with the slot index fc to a scheduling decision 
flfc. Formally, 

'^k '■ ([7r„i, 7r,„-i, . . . , TTfcj'^, {a,n, Om-i, . . . , ak+i}) ak 
Vfce [l,TO],7rfe G [0,1]^. (3) 

where {am,am~i, ■ ■ ■ ,o,k+i} are the past scheduling deci- 
sions and [tTto, TTm-i, • ■ ■ , Trfe]*^ are the belief values of the 
channels of all users, corresponding to slots {to, to— 1, . . . , k}, 
held by the scheduler at the moment (slot k). 

Net expected reward in slot t, Vt: With the scheduling policy, 
{^fc}fc=i' fixed, the net expected reward in slot t, i.e., Vt, is 
the sum of the reward expected in the current slot t and the 
net reward expected in all the future slots k < t. Formally, 
with flfc denoting the scheduling decision in slot k, 

Vt{[TTm,T^m~l, ■ ■ ■ , T^tf , {am, Om-l, ■ • ■ , Oi+l}, {'^k}k=l) 

~ Rt{T^t,at) + E [Vf_i([7r,„, 7r„i_i, . . . , TTf , 7rt_i]*~"^, 

{am, am-1, ■ ■ ■ , at+i,at}, {3fc}fc^\)] , 

(4) 
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where Rt{TTt,at) is the expected immediate reward and the 
expectation in the future reward is over the feedback received 
in slot t, i.e.. Ft, along with the originating slot indices. Note 
that the belief vector [tt™, TTm^i, . . . , vTf]* is up-to-date based 
on all previous scheduling decisions and the ARQ feedback 
received before slot t. With the reward structure defined earher, 
the expected immediate reward can be written as 

Rt[-Kt,at) = TTtiat). 

Performance Metric: For a given scheduling policy 
{^fclfcLi' the performance metric is given by the sum through- 
put (sum rate of successful transmission) over a finite horizon, 
m: 

V.uUrn,mT=i) = ^"^^""^^'^^^^^ (5) 

TO 

where tt™ is the initial belief values of the channels. 

III. Greedy Policy - Optimality, Performance 
Evaluation and the Implementation Structure 

A. On the Optimality of the Greedy Policy 

Consider the following policy: 

3fc : TTfc ^ afc = argmaxi?fe(7rfe,afe = i) 

i 

= argmax7rfc(i) Vfc > l,7rfc £ [0, 1]^. 

i 

(6) 

Since the above given policy attempts to maximize the ex- 
pected immediate reward, without any regard to the expected 
future reward, it follows an approach that is fundamentally 
greedy in nature. We henceforth call {SfcjjJLi the greedy 
policy and let denote the scheduling decision in slot k under 
the greedy policy. We now proceed to establish the optimality 
of the greedy policy when N — 2. We first introduce the 
following lemma. 

Lemma 1: For any u,v E {0, 1,2,.. .} and any x,y E [0, 1] 
with X > y, 

r"(r) < r"+i(a;) 
T^x) > T"{y) 

T^{p) > T^r). (7) 

The results of Lemma [T] can be explained intuitively. Note 
that T"(a;) is the belief value of the channel (probability that 
the channel is in the ON-state) in the current slot given the 
belief value, u slots earlier, was x. Also note that T'^{p) 
(similarly T'^{r)) gives the belief value in the current slot 
given the channel was in the ON state (similarly OFF state) 
u+1 slots earlier. Now, since the Markov channel is positively 
correlated (p > r), the probability that the channel is in the 
ON state in the current slot given it was in the ON state u + 1 
slots earlier (T'"{p)) is at least as high as the probability that 
the channel is ON in the current slot given it was ON with 
probability x G [0,1], u + 1 slots earlier (T("+i)(a;)). This 
explains the first inequality in Lemma [T] The second and third 
inequalities can be explained along similar lines. Regarding the 
last inequality, consider slots t, k such that t > k. Due to the 



Markovian nature of the channel, the closer slot k is to t, the 
stronger is the memory, i.e., the dependency of the channel 
state in k with that of t. Now, since the channel is positively 
correlated, if the channel was in the ON state in slot t, the 
closer k is to t, the higher is the probability that the channel is 
ON in slot k. By definition, this probability is given by T^{p) 
with u — t — k — 1. Thus T'^ {p) monotonically decreases with 
u. Using a similar explanation, T'"{r) monotonically increases 
with u. The limiting value of both these functions, as u — > oo, 
is the probability that the channel is ON when no information 
on the past channel states is available. This is given by the 
steady state probability!! This explains T"(p) > T"{r) for 
any u,v E {0, 1, . . .}. A formal proof of Lemma [T] can be 
found in Appendix lAl 

Proposition 2: For iV = 2, the sum throughput, 
?7sum(?Ti, {'3ik}kLi), of the system is maximized by the greedy 
policy {'^kYk^i for any ARQ delay distribution. 

Proof: Consider a slot t < m. Fix a sequence of 
scheduling decisions at+i :— {a„i,am-i, ■ ■ ■ ,at+i}- Recall 
the definition of Ft+i, the feedback arriving at the end of slot 
t+1, from Section Hl-CI Let Tt+i denote the originating slots 
corresponding to feedback Ft+i, i.e., if the feedback from 
users a„ and a„, for m>u>v>t + l, both arrive at slot 
t+1, then Ft+i = [/„ fv] and Tt+i = [u v]. Also define 
ki G {0, TO, TO — 1, . . . , t + 1} as the latest slot from which the 
ARQ feedback of user 1 is available at the scheduler by (the 
beginning of) slot t. Formally, if at least one ARQ feedback 
from user 1 has arrived at the scheduler by slot t, then 

fci — min k. 

k(E{m,m — l,...,t-\-l} s.t a/c— 1, fk has arrived by slot t 

(8) 

If no ARQ feedback from user 1 has arrived by slot t, i.e., 
if ^ a fc such that 'k G {to, to — 1, . . . , t + 1} s.t Ofc = 
1, fk has arrived by slot t\ then ki — 0. Let li = fci — < — 1, 
when ki ^ 0, be a measure of 'freshness' of the latest 
feedback from user 1. Let Zi = when fci = 0. Similarly 
define k2,l2 for user 2. With these definitions, the proof 
proceeds in two steps: In step 1, we show that the greedy 
decision in slot t, given the ARQ feedback and the scheduling 
decision from slot min(fci, ^2), is independent of the feedback 
and scheduling decision corresponding to slot max(fci, k2). In 
step 2, we show that, if the greedy policy is implemented 
in slot t, then the expected immediate reward in slot t is 
independent of the scheduling decisions ot+i. We then provide 
induction based arguments to establish the proposition. 

Step 1: Let Ft+i ■= {Fm, F^^i, . . . , Ft+i} and n+i := 
{Trn,T,n-i, ■ ■ ■ ,Tt+i}. The greedy decision in slot t, condi- 
tioned on the past feedback and scheduling decisions is given 
by 

The preceding equation comes directly from the first order 
Markovian property of the underlying channels. Consider the 
case when ki < k2 < to (^ li < I2) or /ci = ^2 = 

-We will discuss the steady state probability in Section IIVI 
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h — h — 0)- The belief values in slot t as a function of 
feedback fk^ and //j^ is given below: 



('rt(l),7rt(2)) 

'(t'Hp),t'=(p)), 

(T'Hp),T'=(r)), 
(T'i(p),T(™-*)(7r™(2))), 
(T'i(r),r'^(p)), 
(T'i(r),r'^(r)), 
(T'i(r), r(™-*)(7r„(2))), 



if /fei = 1,//C2 = 1 

if //Ci = 1,//C2 = 

if/fci =l,fc2 = 

if = 0, A, = 1 

if fk, = 0, fk, = 

if fk, = 0, fc2 = 



l(T("-*)(^„(l)),T("-*)(^„(2))), if fci = 0, fc2 - 



(10) 



Using Lemma [T] the greedy decision can be written as 



^arg maxig{i^2}(7rm(«)), 



if /fei = 1 
if fk, = 
if fci = 0, fc2 



(11) 



Thus the greedy decision is independent of feedback if 
fci < fc2- We now proceed to generalize equation ( fTTT i. Let fc* 
denote the latest slot for which an ARQ feedback is available 
from one of the users by slot t, i.e.. 



'min{fci,fc2}, iffci^0, fc2^ 

fci, iffci7^0,fc2 = 

fc2, iffci=0,fc27^ 

0, if fci = 0, fc2 = 



(12) 



Let / = fc* 



t 



1 for fc* 7^ and / = for fc* = be a 
measure of freshness of the latest ARQ feedback. Thus, using 
the preceding discussion, we have 

'^i l/fci Jk2 ,h,h,at+i,Trm 

flfc., iffc*^0,/fc. =1 

flfc., iffc*/0,/fc. =0 

^argmaxig{;i^2}(7rm(j)): if fc* = 

(13) 

where at- is the user not scheduled in slot fc*. This completes 
step 1 of the proof. 

Step 2: If the greedy policy is implemented in slot t, the im- 
mediate reward expected in slot t, conditioned on scheduling 
decisions at+i and initial belief tt^ can be rewritten as 

E^t|at+i,7r,„ RtiT^tjClt) 

= ^7Tt\i=iDMt+i,7T„S^t{Trt,at))P{l = 0|at+i,7r„) 

where I is defined after ST% . Note that 

E.,|i=0.s,^,..„(i?t(^t,at)) = maxr('"-*)(7r„,(*)n5) 

since, with / = 0, i.e., no past feedback at the scheduler, the 
belief values at slot t is independent of the past scheduling 



decisions and is simply given by -Kt — T^™ Now 
rewriting the second part of (fl4] i. 



EjTt |7r,+t+i ,/,;^0,3t+i ,7r„ (-Rt (""t , a* )) ■ 



(16) 



Consider E 



7rt|7ri+t+i,/,;#0,at+i,7r,„(-Rt(7rt,at)). From the first 
step of the proof, the greedy decision in slot t can be made 
solely based on the latest feedback, i.e., fk'=i+t+i- This was 
recorded in (T3[ . Thus, if the feedback /fc. is an ACK (occurs 
with probability Tri+t+i{o,i+t+i)) reschedule the user ai+t+i 
in slot t. Conditioned on fk* = 1, the belief value Trt{ai+t+i) 
and hence the expected immediate reward in slot t is given 
by T^{p). If the feedback is a NACK, schedule the other user 
denoted by ai+t+i- Conditioned on fk' = 0, the belief value 
7i't(a;+t+i) and hence the expected immediate reward in slot t 
is given by T'-^+^\ni+t+i{ai+t+i)) = ni+t+i{ai+t+i)T\p) + 
(1 - 7r/+t+i(a/+t+i))T'(r). Averaging over fk*=i+t+i, we 
have 



E 



'rt|7ri+t+i.;,/#0,at+i,;r„(-Rt(7I't, a*)) 



= n+t+i{ai+t+i)T''ip) + (1 - TTi+t+iiai+t+i)) X 

(^7r/+t+i(a/+t+i)r'(p) + (1 - 7r/+t+i(a,+t+i))T'(r)) 

= P({Wi(l)-lU Wi(2)-l}| 

n+t+iJJ 0,at+i,7rm)r'(p) 

+p{{Si+t+i{i) = n Si+t+ii2) = o}| 

TTi+t+1,1,1 0,at+i,7r„)r'(r) 

(17) 

where Sk{i) is the 1/0 state of the channel of user i in slot 
fc. From ( fTSI l. 

E;,/5^0|at+i,7r,„ ^Trt\l,l^lll.at+i,Tr,„{Rt{'^t,at)) 



= E, 



( 



P({Wi(l) = lU Wi(2)-l}| 

Tri+t+i,l,l ^ %,at+l,TTm)T\p) 

+P{{Si+t+i{l) = n S,+t+i{2) = 0}| 

TTi+t+i,l,l 7^ 0,at+i,7r„i)T'(r)^ 

E/,,^0|aVi,-„. {P{{s,+t+i{l) = 1 U Si+t+i{2) = l}\ 

l,l^(l>,at+l,TT,n)T'{p) 

+P{{Si+t+i{l)^0nSi+t+i{2)^0}\ 
l,l^(d,at+i,TT,n)T\r) 



= E 



/,!#0|at+i,7r„ 



(P({ Wi(l) = 1 U Si+t+i{2) = l}|7r„)r'(p) 
+P{{Si+t+iil) = n Si+t+ii2) = 0}|^,„)T'(r)) 



(18) 



We have used the following argument in the last equality: 
the event ({S';+t+i(l) = 1 U S';+t+i(2) = 1}) is controlled 
by the underlying Markov dynamics and is independent of the 
scheduling decisions dt+i- Likewise, this event is independent 
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of the value of / since we have assumed that the feedback 
channel and the forward channel are independent. 

Recall D{i, k) is the random variable indicating the delay 
incurred by the ARQ feedback sent by user i in slot k. Let 
L be the random variable corresponding to the quantity I, the 
degree of freshness of the latest ARQ feedback, and Pl{-) be 
the probability mass function of L. Therefore, for < / < 
m — < — 1, 

PL{l\at+i,nm) 

= P{{D{ai+t+i,l + t+l)<l, D{ai+t, l + t)>{l- 1), 
D{ai+t-iJ + t-l) > {1-2)),..., 

= P{{D{ai+t+i,l + t+\)<l, D{ai+t, I + t) > {I - 1), 
D{ai+t-i,l + t-\) > {1-2)),..., 

D{at+i,t + l) > 0}\at+i) 
t+i 

= P{D{l,l + t + l) <l) Yl P{D{l,k)>k-t-l) 

k=t+l 

(19) 

where we have used the independence between the forward 
and the feedback channel to remove the condition on tt„i in the 
second equality. The last equality comes from the assumption 
that the ARQ delay is i.i.d across users and tim^. Similarly 

t+i 

PL(; = 0|at+i,7r™) = YlP{D{ak,k)>k-t-l) 

k—m 

(20) 

Applying the preceding equations in ( fT4] i. we have 

E7rt|at+i,ir,„ Rt{T^t,at) 
t+1 

= n P(7^(afc,fc) >fc-i-l)maxr(™-*)(7r^(i)) 

k—ni 

m-t-1 t+1 

+ ^ P{D{l,l + t + l) < l)WP{D{\,k) >k-t-l) 

1=0 k=t+l 

(^P{{Si+t+i{l) = 1 U Si+t+i{2) = l}|^,„)T'(p) 

+P({5/+t+i(l) = 0n5z+t+i(2) -0}K™)T'(r)) 

(21) 

The expected reward in slot t is thus independent of the 
sequence of actions {am, Qm-i • • ■ ot+i} if the greedy policy 
is implemented in slot t. By extension, the total reward 
expected from slot t until the horizon is independent of the 
scheduling vector Of+i if the greedy policy is implemented in 
slots {t,t — 1, . . . , 1}, i.e., 

1 1 

'^^^rk\at+l.■!r,r^ Rk{T^k,ak) = ^ E^^ ijfe (TTfe , flfe ) . 
k=t k=t 

(22) 

'Note: here we do not require the ARQ delay to be identically distributed 
across time. 



Thus, if the greedy policy is optimal in slots {t,t — 1, . . . ,1}, 
then, it is also optimal in slot t+1. Since t is arbitrary and 
since the greedy policy is optimal at the horizon, by induction, 
the greedy policy is optimal in every slot {m,m — 1,...,!}. 
This establishes the proposition. ■ 

Remarks: When the Markov channels are negatively cor- 
related, i.e., p < r - the case of limited practical significance, 
using arguments similar to those in the preceding proof, we 
can show that the greedy policy is optimal when N — 2, for 
any ARQ delay distribution. We record this below. 

Corollary 3: When the Markov channels are negatively 
correlated, i.e., p < r, and when N — 2, the sum throughput, 
7?sum("i, {3fe}j!Li)' °f system is maximized by the greedy 
poUcy {^kYk^i for any ARQ delay distribution. 

A formal proof can be found in Appendix IB] 

Returning to the original positive correlation setup, the 
arguments in the proof of Proposition |2] hold true even when 
the ARQ delay is not identically distributed across time. Thus, 
the greedy policy is optimal for N = 2 even when the ARQ 
delay distribution is time-variant. Also, since m is arbitrary, the 
greedy policy maximizes the sum throughput over an infinite 
horizon. We record this below. 

Corollary 4: For N — 2, the greedy policy is optimal when 
the performance metric is the sum throughput over an infinite 
horizon, i.e., 

1 r ^"'(tt, {^k}k>i) 
{^kh>i = arg max lim (23) 

for any initial belief tt. 

The optimality of the greedy policy does not extend to the 
case N > 2. We record this in the following proposition. 

Proposition 5: The greedy policy is not, in general, optimal 
when there are more than two users in the downlink. 

Proof outline: We establish the proposition using a coun- 
terexample with deterministic ARQ delay of = 1, i.e., 
Poid = 1) = 1, and arbitrary values of N;N > 2 and 
m;m > 3. We construct a variant of the greedy policy that 
schedules a non-greedy user in a specific time slot under a 
specific sample path of the past channel states observable by 
the scheduler In the rest of the slots and under other realiza- 
tions, the constructed policy performs greedy scheduling. We 
explicitly evaluate the difference in the rewards corresponding 
to the constructed policy and the greedy policy and show that, 
there exists system parameters such that the constructed policy 
has a reward strictly larger than the greedy policy. Thus the 
greedy policy is, in general, not optimal when iV > 2. A 
formal proof can be found in Appendix |C] 

Remarks: Note that, in contrast, it has been shown in ll2n 
that the greedy policy is optimal for any number of users 
when the ARQ feedback is instantaneous, i.e., D — 0. To 
summarize, the optimality of the greedy policy vanishes 

• when the ARQ delay is increased from zero to higher 
values, with the number of users unconstrained, or 

• when the number of users is increased from two to 
higher values, with the ARQ delay being random and 
unconstrained. 
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These observations point to the volatile nature of the underly- 
ing dynamics of the scheduUng problem, with respect to the 
greedy policy optimality. 

It would be interesting to see how the optimality properties 
of the greedy policy extend to more general channel models. 
Considering the multi-rate channels, i.e., when the number of 
states is greater than two, the special 'toggle' structure that led 
to the optimality of the greedy policy in the ON-OFF channel 
vanishes. In fact, we have shown [30| that, even when the 
number of states is increased by 1, the general greedy policy 
optimality vanishes and the optimality can be shown to hold 
only under very restrictive conditions on the Markov channel 
statistics. Now, consider the case when the two-state Markov 
channels are non-identical across users. In this setup, we can 
show that the greedy policy is not, in general, optimal, even 
when the ARQ delay is instantaneous. We record this below. 

Proposition 6: The greedy policy is not, in general, optimal 
when the Markov channels are not identical across users, even 
when iV = 2 and the ARQ feedback is instantaneous. 

The proposition is established using counterexamples. Proof 
is available in Appendix ID] 

In summary, continuing our discussion before Proposition |6] 
the optimality of the greedy policy vanishes even under 
minimal deviations from the original setup. These observations 
further indicate the volatile nature of the underlying scheduling 
problem dynamics. 

Returning to the original setup at hand, numerical results 
suggest that the greedy policy, despite being not optimal in 
general, has near optimal performance. We discuss this next. 

B. Performance Evaluation of the Greedy Policy 

Table |T] provides a sample of the net expected reward under 
the greedy policy (V^reedy) in comparison with that of the 
optimal policy (14pt) when = 3, and when iV = 4, for 
horizon length m — 7. The ARQ delay probability mass 
function is generated (uniform) randomly with the maxi- 
mum delay fimax fixed first. The high values of the quantity 
%subopt= ^"pt-Js^'-iy X 100% illustrates the near optimal 
performance of the greedy policy for the system parameters 
considered. Note that, the optimal reward, Vopt, is evaluated 
by a brute-force search over the scheduling decisions in every 
slot t e {ni, m — 1, . . . , 1}, that is prohibitively complex for 
larger values of N and m. We, therefore, perform an indirect 
study of the greedy policy performance in Tables UlllIVI that 
allows us to consider wider range of system parameters. We 
first define the genie-aided system as follows: for any slot 
k, the feedback fk includes the channel state information, 
corresponding to slot k, of not only the scheduled user but 
also that of all the users in the system. Thus the optimal reward 
in the genie-aided system, Igenie, is an upper bound to the 
optimal reward in the original system, Vopt- Also, V^enie can 
be evaluated using closed-form expressions, with complexity 
much lower than that of Vopt- We will discuss the evaluation 
of VJrcnio in the context of the genie-aided system sum capacity 
in Section IIV-AI 

In Table HIl with the maximum ARQ delay dmax — 1, 
the net expected reward under the greedy policy is compared 



with V^cnic when = 10 and when N — 20, for randomly 
generated values of p and r. The length of the horizon is 
fixed at m = 10. The probability mass function of the 
ARQ delay, denoted by 'Delay' in the table, is controlled 
to have a weakening 'tail' from [0 1] to [|^]- The quantity 
%genie= '4cnic-Vg,„ody ^ ioo% is an upper bound to the 

^gcnie ^ 

quantity %subopt introduced earlier Table |III] is similarly 
constructed with the maximum ARQ delay dmax = 2. In both 
Tables HI] and Hill we see that %genie is predominantly low- 
valued, suggesting that the greedy policy has near optimal 
performance. Also, note that, as the tail of the ARQ delay 
mass function weakens, both V^cnic and V^rcody increase. This 
is expected since, with a weakening tail, the ARQ feedback is 
stochastically more 'fresh', thereby facilitating better informed 
scheduling decisions and higher rewards in both genie-aided 
and original systems. Also note that, as the tail weakens, the 
gap between the optimal rewards in the genie-aided system 
and the original system can be expected to increase, since 
the gap between the information content of the full feedback 
(genie-aided system) and the ARQ feedback increases with 
a weakening tail. Thus, the relatively high values of %genie 
corresponding to weaker delay tails, could be due to an 
inherent system level gap between the genie-aided and the 
original systems, and need not necessarily be a pointer to 
the greedy poUcy performance. The last statement is further 
strengthened by the fact that the greedy policy is optimal when 
the ARQ delay tail is at the weakest, i.e., when the feedback 

is im. 

In Table IIVI we study the effect of the Markov channel 
memory, defined as (p — r), on the reward functions. With 
rfmax = 2, TO = 10 and N = 20, we consider two extreme 
values of the channel memory, i.e., {p ~ r) — 0.2 and 
(p — r) = 0.8. In both cases of channel memory, we have fixed 
the steady state probability of the ON state to be tTss =0.5, 
by fixing p + r = 1. This essentially provides a degree of 
fairness when comparing these two cases. Note that, for a 
fixed delay statistic, the rewards T^enie and V^rcody increase 
with increase in the channel memory. This is due to an increase 
in the value of the feedback, as the channel memory increases. 
Also, we see an increase in the value of %genie as the memory 
increases. This points to two underlying phenomena: 1) An 
increase in the inherent sub-optimality associated with greedy 
scheduling as the channel memory increases 2) Similar to 
the case of weakening delay tail, an increase in the channel 
memory results in an increase in the system level gap between 
the genie-aided and the original systems, by way of an increase 
in the gap between the information content of full feedback 
(genie-aided system) and the ARQ feedback. 

Summarizing, Tables Hi ll VI suggest that the greedy policy 
has near optimal performance for a wide range of system 
parameters and that the ARQ delay profile and the channel 
memory affect the reward values in ways that can be explained 
intuitively. In addition, note that %genie is also an upper bound 
to the quantity ^s°^''~^°p* x 100%. Thus the low values of 
%genie provide the following larger message: using only the 
1-bit ARQ feedback for opportunistic scheduling is associated 
with system level performance comparable to the case when 



N 


Delay = \Pn(0) . . . Pn(rfm„x)l 


P 


r 


Vopt 


^^rccdy 


%subopt 


3 


[0.8822 0.1178] 


0.9172 


0.2858 


6.0707 


6.0696 


0.0182 % 


4 


[0.5387 0.4613] 


0.9464 


0.1666 


5.9700 


5.9586 


0.1910 % 


3 


[0.5908 0.3959 0.0132] 


0.6619 


0.2389 


3.9933 


3.9914 


0.0476 % 


4 


[0.6647 0.1844 0.1510] 


0.9281 


0.2824 


5.8934 


5.8854 


0.1364 % 



table: 

Comparison of the performance of the greedy policy with the optimal reward. 



N-10 


Delay 


p = O.-'xsl.s, = 0.:ir)OI) 


/; = 0.(i:-!<)2. /■ = ().2:-!2<S 


^enie 


greedy 


%genie 


^enie 


^^reedy 


%genie 


[0 1] 


5.3908 


5.2912 


1.8470 % 


5.5279 


5.2067 


5.8109 % 




5.6547 


5.4281 


4.0072 % 


5.9195 


5.4119 


8.5741 % 


f- -1 


5.7867 


5.4987 


4.9771 % 


6.1152 


5.5208 


9.7203 % 


[2 ll 


5.9187 


5.5712 


5.8703 % 


6.3110 


5.6353 


10.7070 % 


N-20 


Delay 


p = 0.9148, r = 0.4309 


p = 0.3079, r = 0.2517 




^^rccdy 


% genie 




^^rccdy 


%genie 


[0 1] 


8.8565 


8.8254 


0.3504 % 


3.4487 


3.4371 


0.3368 % 


[1 21 
U 3J 


8.9715 


8.9291 


0.4723 % 


3.5525 


3.4661 


2.4315 % 


I2 2' 


9.0290 


8.9820 


0.5203 % 


3.6043 


3.4807 


3.4300 % 


[2 ll 
U 3J 


9.0865 


9.0357 


0.5593 % 


3.6562 


3.4955 


4.3967 % 



TABLE II 

Comparison of the performance of the greedy policy with the optimal reward in the genie- aided system. Maximum ARQ delay, 

C^max — 1. 



N-10 


Delay 


p = 0.2148, r = 0.1100 


p = 0.6863, r = 0.4136 




^^rccdy 


%genie 




^^rccdy 


%genie 


[0 1] 


2.0196 


2.0162 


0.1716 % 


6.2768 


6.2571 


0.3131 % 


[1 1 ll 
Le 3 2' 


2.1261 


2.0384 


4.1241 % 


6.4895 


6.3813 


1.6663 % 


[1 1 ll 
1 3 3 3 J 


2.2152 


2.0577 


7.1089 % 


6.6375 


6.4743 


2.4587 % 


[1 1 ll 

L2 3 61 


2.3018 


2.0772 


9.7568 % 


6.7764 


6.5677 


3.0792 % 


N-20 


Delay 


p = 0.8822, r = 0.2816 


p = 0.7120, r = 0.5713 




^^rccdy 


%genie 




^^rccdy 


%genie 


[0 1] 


8.0485 


7.9811 


0.8376 % 


7.0084 


7.0066 


0.0251 % 


[1 1 ll 
Le 3 2' 


8.3208 


8.1880 


1.5952 % 


7.0868 


7.0585 


0.3989 % 


[1 1 ll 

^3 3 3J 


8.4754 


8.3186 


1.8493 % 


7.1495 


7.1017 


0.6675 % 


[1 1 ll 
L2 3 61 


8.6131 


8.4490 


1.9057 % 


7.2099 


7.1448 


0.9018 % 



TABLE in 

Comparison of the performance of the greedy policy with the optimal reward in the genie-aided system. Maximum ARQ delay, 

C^max — 2. 
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(v — 


Delay 




greedy 


%genie 


i'D — t\ 
\f ' J 


Delay 




V^rccdy 


%genie 




[0 11 


5.6342 


5.6232 


0.1953 % 




\0 11 

W -LJ 


7.9848 


7.7252 


3.2520 % 


0.2 




5.8068 


5.7105 


1.6592 % 


0.8 


ri i il 
3 2I 


8.3585 


8.0181 


4.0726 % 


[1 1 il 
I3 3 3J 


5.9357 


5.7797 


2.6283 % 


fi i il 

L3 3 3' 


8.5551 


8.1843 


4.3347 % 




[i i il 

I2 3 eJ 


6.0584 


5.8494 


3.4499 % 




ri i il 

L2 3 6l 


8.7265 


8.3522 


4.2890 % 



TABLE IV 

Illustration of the effect of the Markov channel memory, (p - r) on the reward functions. Maximum ARQ delay, dmax = 2. 



feedback is available from all the users. 

C. Structure of the Greedy Policy 

Motivated by the near optimal performance of the greedy 
policy, we proceed to study its structure, which turns out to 
be very amenable for practical implementation. We begin by 
defining the following quantity: 

Schedule order vector, Ot, in slot t: The user indices in 
decreasing order of TTt{i), i.e.. 



Ot{l) 



Ot{N) 



arg maxTTf (i) 



arg min7rt(i). 



Thus, the greedy decision in slot t is at — Ot(l)- 

Now, in any slot t < m, any user i falls under one of the 

following two cases: 
1) The scheduler has received at least one ARQ feedback 
from user i by the beginning of slot t. Let ki, for m > 
ki > t, be the latest slot for which the ARQ feedback 
from user i is available at the scheduler. Since the channel 
is first-order Markovian, the belief value of the channel 
of user i in the current slot t is dependent only on the 
feedback /fc. and ki. The belief value is given by 



7rt(i) 



yfe.-*-i(p) if/j. =1 
yfe.-t-i(^) if/i. =0. 



(24) 



2) The scheduler does not have any ARQ feedback from 
user i by the beginning of slot t. In this case 



(25) 



Recall that 7r,„(i) is the initial belief value of the channel 
of user i when the scheduling process started at slot m. 
At slot t, let At denote the set of users, i, whose latest 
feedback, fk^, is an ACK. Let Aft denote the set of users, 
j, whose latest feedback, , is a NACK. Let the users 
from whom the scheduler has not yet received any feedback 
constitute set Xf From (l24l i and (l25l l. using Lemma [T] the 
greedy decision in slot t can be written as 

arg mini ki if At ^ 9 
Of = <; argmaxigA't 7r„(i) if A = and Xt ^ (/) (26) 



argmaxigjv't h 



if A = and Xt = 0. 



Now, for ease of implementation, we visualize the sets At, 
Xt and Aft as queues with elements ordered in the following 
specific ways: Let At{i) denote the i*'' element of queue At 
and the elements be ordered such that < ^^^(2) • ■ ■ < 

kAt{n{At))' where n{A) denotes the cardinality of set A. Note 
that the user that gave an ACK from the most recent slot Ues 
at the head of queue At- The elements of Xt are ordered such 
that irMil)) > ttMC^))--- > TTmiXtiniXt))). The 
elements of Aft satisfy fcjv-j(i) > k^t(2) • • • > kj^t{n{Nt)), i-e., 
the user with the oldest NACK feedback lies on top of queue 
A/j. Define a combined queue constructed by concatenating the 
queues At, Xt and Aft in that order From (l24l l and ( l25T l. using 
Lemma [T] we see that the users in the combined queue are 
arranged in decreasing order (top-down) of belief values with 
the top-most user being the greedy decision in slot t. Thus the 
combined queue is, in fact, the schedule order vector Ot- 

We now discuss the evolution of the schedule order vector. 
For every user a whose ARQ feedback is contained in Ft, 
implement the following procedure: Let ta indicate the orig- 
inating slot for the ARQ feedback from user a contained in 
Ft- Now, if ta is the latest slot from which the ARQ feedback 
of user a is available at the scheduler, then ka — ta- The 
new schedule order vector Ot-i is formed by removing user 
a from its current position (in Ot) and placing it in the sub- 
queue At-i (if fka — 1) or in the sub-queue Aft-i (if /fc„ = 0) 
at an appropriate location (so that the ordering based on ki is 
not violated). If ta ^ ka, i.e., ta is not the latest slot, then user 
a is not moved. Similarly, users whose ARQ feedback are not 
contained in Ft are not moved. The last two statements are 
direct consequences of the following facts: 

« For an user a whose ARQ feedback is contained in Ft 
but is not the latest feedback from that user, the belief 
value evolves as 7rf_i(a) — T{TTt{a))- Similarly, for an 
user b whose ARQ feedback is not contained in Ft, the 
belief value evolves as 7rt_i(6) = T{TTt{b))- Both these 
cases were discussed in Section Hl-CI 
> From Lemma [T] if x > y, then T{x) > T{y)- 
Now, at slot t — 1, the user on top of Ot^i is the greedy 
decision. Thus the greedy decision in any slot is determined 
by the latest ARQ feedback and the corresponding originating 
slot index of all the users in the system. Note that this 
implementation does not require the Markov channel statistics 
(other than the knowledge that p > r) and the statistics of 
the ARQ feedback delay. An illustration of the greedy policy 
implementation is provided in Fig. |2] 

For the special case of deterministic ARQ feedback delay 
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SL-hcduk Ihc user on lop (aj ) 



End of slot t. 
Ff received 



decreases 



ki. i e Nt 
decreases 



set Nt 



ils 



Remove 

CLrrcnl position. 
Iiiserl here siieli lha 
Af , > ka > k J 




Fig. 2. Greedy policy implementation under random ARQ delay. 



D ~ d, the evolution from Ot to Ot-i is greatly simplified as 
follows. At the end of slot t, since D ~ d. Ft contains feed- 
back only from the user scheduled in slot t + d, i.e., user dt+d- 
Thus Ft = ft+d- The feedback bits fm, fm-i-, ■ ■ ■ ■, ft+d+i 
from users dm, flm-i, • ■ • , dt+d+i have akeady arrived at the 
end of slots m — d, to — 1 — d, + l and the feedback from 
users dt+d-i,dt+d-2,--- are yet to arrive. Thus Ft = ft+d 
from user dt+d is the latest feedback available from any 
user. Thus, recalling the ordering rules for At-i and Nt-i, 
if Ft — 1, user dt+d is removed from its current position 
and placed on top in the updated schedule order vector, i.e., 
Ot-i = [dt+d Ot - dt+d], (user dt+d becomes the greedy 
decision in slot t — 1). If Ft — 0, dt+d is placed at the 
bottom, i.e., Ot-i — [Ot — dt+d dt+d\- When there is no 
ARQ delay (D — d — 0), the implementation becomes even 
simpler: on receiving an ACK, Ot~^i — Ot, and on NACK, 
Ot-i = [Ot - Ot(l) Ot{l)], since dt+d = at = Ot(l). This 
results in a simple round robin implementation of the greedy 
policy as discussed in ||2TI . ||27|. Fig.|3]and Fig. |4] illustrate the 
greedy policy implementation in the deterministically delayed 
ARQ and instantaneous ARQ systems, respectively. 



IV. On Downlink Sum Capacity and Capacity 
Region 

We now proceed to study the fundamental limits on the 
downlink system performance — the sum capacity and the 
capacity region. 



Scliedale lire user on lop (aj) 



;nl position, 
ice on lop 



= ft^d received 
from user a j i j 




eniove oj^j irom ii 
current position. 
Place at the bottom 



Fig. 3. Greedy policy implementation under deterministically delayed ARQ, 
i.e., D = d. 



"^If Z = [zi Z2 Z'i] then Z — Z2 := [zi 23] and hence [z2 Z — Z2] = 

[Z2 Zl Z3] 
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remains on lop. 
Il will be reschednled 
in Ihe next slol 



= received 
from user af 




Place ai lire bono 



Fig. 4. Greedy policy implementation under instantaneous (end of slot) ARQ, 
i.e., D = 0. 



A. Sum Capacity of the Downlink 



The sum capacity of the downhnk is defined as the maxi- 
mum sum throughput over an infinite horizon with steady state 
initial conditions. Formally, with N users in the system. 



max 



lim 



(27) 



where Mi e {l,...,iV}, Ussii) — Ps, the steady state 
probability of the Markov channel. We now proceed to derive 

P 1 -P 



Ps- The Markov chain transition matrix P = 
be expressed as P = UAV, where 



can 



U 
A 
V 



1 1 
1 ^ 

1 

p — r 

1 



1 



i-p 
1 



1 

-1 



with VU = 



Assuming^! p + {I — r) < 2, 



lim P" 

n— ^OO 



Ps 



l-(p-r) 
r 

l-(p-r) 



1 - 
1 - 



l-(p-r) 
r 

l-(p-r) 



1 - (p - r) ' 



Recall, from Section IIII-BI the definition of the genie-aided 
system: In any slot k, the feedback /j. contains the channel 
state information, corresponding to slot k, of not only the 
scheduled user but also that of all the users in the system. 
Also, the delay profile from the original system is retained 
in the genie-aided system, i.e., the cumulative feedback fk 
arrive at the scheduler with delay D{ak, k) that is i.i.d across 
scheduling choice and originating slot k with the proba- 
bility mass function Po{d). Thus, thanks to the cumulative 
nature of the feedback, the scheduling decision in the current 
slot does not affect the information available for scheduling in 
future slots. Hence, the greedy policy is optimal in the genie- 
aided system. With this insight, we now report our result on 
the sum capacity of the original downlink with two users. 

Proposition 7: When N ~ 2, the sum capacity of the 
Markov-modeled downlink with randomly delayed ARQ 
equals that of the genie-aided system. This sum capacity 
equals 



J2 [psT'ip) + {l-ps)Ps]P{D < l)llP{D > d). 

(28) 



1=0 



d=0 



Furthermore, the greedy policy achieves this sum capacity. 

Proof: We first focus on the sum capacity of the genie- 
aided system, i.e., the sum throughput of the greedy policy 
in the genie-aided system. Recall, from Section IIII-AI the 
quantity L - the measure of freshness of the latest ARQ 
feedback. We defined L such that L = I ^ the latest 
feedback is I + 1 slots old. We extend the meaning of L 
to the genie-aided system. Due to the first order Markovian 
nature of the channels, in the genie-aided system, conditioned 
on the latest feedback, ft+i+i (with t denoting the current 
slot), the belief values (and hence the greedy scheduling 
decision) in the current slot are independent of the feedback 



from previous slots, i.e., fi 



k.k>t+i+i. Thus, with 
denoting the conditional (conditioned on L = I) immediate 
reward corresponding to the greedy policy, in the A^-user 
genie-aided system with steady state initial conditions, the sum 
capacity of the genie-aided system can be written as 



^sum 



(N) 



(29) 



We now evaluate R^cnic^ ^ ^) ■ Fro^n Lemma [T] the belief 
value (in the current slot) of an user with an ON channel / + 1 
slots earlier, i.e., T^{p), is higher than the belief value of an 



>+ (1 

state. 



r) 



2 leads to P 



a trivial case with no steady 
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user with an OFF channel I + 1 slots earlier, i.e., T\r). Thus, 
in steady state. 



.greedy 



il,N) 

P(at least one of the N users has an ON channel in 

steady state)T'(p) 

h-P(all users have OFF channels in steady state)r'(r) 
(1 - (1 - p.)^)T'(p) + (1 - p.)^T'(r). (30) 



By explicitly including the probability mass function of the 
quantity / as a function of the ARQ delay statistics, from i 
and (|30] |. we have 



E [(1 - (1 - P^)'')T\P) + (1 - PsfT^r) 



1=0 



(31) 



d=0 



When N = 2, with minor algebraic manipulations, we have 

oo l—l 

E [psT\p) + {l-ps)Ps]P{D < 1)11 P{D > d). 

(32) 



i=0 



We now proceed to prove that the sum throughput of 
the greedy policy in the original system equals that of the 
greedy policy in the genie-aided system when N = 2. We 
established in the course of the proof of Proposition |2] that, 
in the original system with N = 2, conditioned on L = I, 
the greedy decision in the current slot t is solely determined 
by the ARQ feedback from slot t + I + 1 with the following 
decision rule: When the user scheduled in slot t + I + 1, i.e., 
at+i+i, sends back an ACK, that user is scheduled in the 
current slot t, i.e., at = at+i+i- Otherwise, the other user is 
scheduled in slot t. We can interpret this decision logic of 
the greedy policy as below: 

When at least one of the users had an ON channel in slot 
t + 1 + 1, that use^ is identified for scheduling in the current 
slot t, leading to an expected current reward ofT\p). Reward 
T^{r) is accrued only when both the channels were in the 
OFF state in slot t + I + \. 

Note that the decision rule and the accrued immediate rewards 
corresponding to the greedy policy in the original system 
are the same as that of the greedy policy in the genie-aided 
system. Thus, in the original system, under the greedy policy, 
no improvement in the immediate reward can be achieved 
even if the channel states of both the users in slot t + / + 1 
are available at the scheduler in slot t. This, along with 
the fact that both the systems have the same delay profile, 

*User a-ij^ij^i is given liiglier priority if botli cliannels were ON. 



establishes the equivalence between the original and the genie- 
aided systems, when = 2, in terms of the sum throughput 
achieved by the greedy policy. We have already proved the 
sum throughput optimality of the greedy policy in the original 
system when N = 2 (Proposition |2]i and in the genie-aided 
system for a general value of N . Thus the sum capacity of 
the original system for = 2 is given by C|™j''(2) in ( |32] |. 
The proposition thus follows. ■ 
Remarks: Insights on the result in Proposition |7] can 
be obtained by examining the fundamental trade-off when 
scheduling in the Markov-modeled downlink. In particular, 
scheduling must take into account 

1) data transmission in the current slot, which influences the 
immediate reward, and 

2) probing of the channel for future scheduling decisions, 
which influences the reward expected in future slots. 

The optimal schedule strikes a balance between these two 
objectives (that need not contradict each other). From the 
discussion in the proof of Proposition |7] we see that, in 
the original system, when N = 2, the choice of the user 
whose channel is probed becomes irrelevant as far as the 
optimal future reward is concerned. Similarly, in the genie- 
aided system, since the channel state information of all the 
users (general A^ system) is sent to the scheduler (with equal 
delay that is i.i.d across the scheduling choice) irrespective 
of which user was scheduled, the optimal future reward is 
independent of the current scheduling decision. This results 
in the optimality of the greedy policy in the original and the 
genie-aided systems and creates a sum capacity equivalence 
between these two systems, when N ~ 2. 

The equivalence with the genie-aided system vanishes when 
N > 2, since observing only one user is not enough to capture 
an 'ON-user', if one exists. This was possible when N — 2. 
Thus, when N > 2, there is room for throughput improvement 
when the channel state information of all the users is available 
at the scheduler even if there is a delay (the genie-aided 
system). The genie-aided system sum capacity is thus an upper 
bound to the sum capacity of the original system. We record 
this next. 

Corollary 8: When N > 2, the sum capacity, Csum(-^), of 
the downlink can be bounded as 



(33) 



Proof: The lower bound Csum(2), given in ( l28i l. is 
achieved by the scheduler when, in each slot, it considers 
only two users (fixed set) for scheduling and ignores the rest, 
effectively emulating a two-user downlink. The upper bound 
is the sum capacity of the genie-aided system with A^ users, 
as given in (ISTT l. ■ 



B. Bounds on the Capacity Region of the Downlink 

Define the capacity region of the downlink as the exhaustive 
set of achievable throughput vectors. Formally, let /i^ denote 
the throughput of user i under policy 3. Let /^(i) be the 
indicator function on whether user i was scheduled in slot 



13 



k, i.e., 



Thus 



1 if i — Gk 
otherwise. 



(34) 



where i?^(7rfe,afc) is the immediate reward accrued by the 
scheduler in slot k under policy 3. The expectation is over 
the belief vector tt^ with steady state initial conditions. Now, 
the capacity region of the downlink, C, is defined as the union 
of the throughput vectors, {fxf, . . . , /x^), over all scheduling 
policies, i.e.. 



(36) 



Let i?convcx(-''^) be the convex hull of the set of points X, 
defined as 

-^convex ( ) 

n{X) n{X) 

= { ^ Ax, I X, eX,/3, eM,A>o, ^ A = i}. 

i=l i=l 

where n{X) is the cardinality of set X. With these definitions 
we now state our results on the downlink capacity region. 

Proposition 9: An outer bound on the capacity region of 
the Markov-modeled downlink with randomly delayed ARQ 
is given by the complement of the A^-dimensional polyhedron 
V represented by 



V = 



^{xi>0,X2>0...XN>0): 

< CiZ''in{S)),yS C {1,...7V}}, (37) 



where 



oo 



1=0 



i-i 



d=0 

An inner bound on the capacity region is given by the set of 
points (xi, . . . , xn) such that 

(xi, . . . ,xjv) 

£ Hconvcx{0, {-'^i}vie{l,...,A'}7 {^■,fc}vj\fce{l,...,Ar} j^^fc) 

(38) 

where 0,X^,Yj^k S ffi^. O is the origin (0,...,0). X, = 
(0, . . . , 0,Ps, 0, . . . , 0) with ps at the location. Yj^^j^k ~ 
(0,...,0,%^,0,...,0,%^,0,...,0) with %^ 
locations j and k, where 

oo 

Csum{2) = J2 [P-T'(P) + (1 - P-)P^ X 

l-l 

P{D < l)Y[P{D > d). 



at 



1=0 



Proof: Considering the genie-aided system, for any 
policy 9, let the throughput vector be denoted by 
(^i?'*^™'", . . . , AiJ'*'™"). For a subset of users 5 C {1 . . . iV}, 
by the definition of sum capacity, we have 



E 



(39) 



d=0 



This establishes the complement of the polyhedron V as an 
outer bound on the capacity region of the genie-aided system, 
and by extension, an outer bound on the capacity region of 
the original system. 

Now, consider the inner bound 

^^convcx(0, {-'^^i}vie{l,...,Ar}, {Yj,k}vj.ke{l,...,N},j^k)- 

In the original system, throughput vector Xi = 
(0, . . . , 0,ps, 0, . . . , 0) can be achieved by scheduling to 
user i at all times. Recall that the greedy policy achieves 
the sum capacity when N ~ 2. Also the sum throughput 
^3^11(2) is split equally between the two users thanks to the 
inherent symmetry between users. Thus throughput vector 
Yj,k,j^k = (0,...,0,%^,0,...,0,%^,0,...,0) can 
be achieved by greedy scheduling over the users j and k alone 
at all slots. Throughput vector O corresponds to idling in 
every slot. Therefore, any throughput vector in the convex hull 

-ffconvex(0, {Xi}vjg{i^...,Ar}, {yj-,fc}vj,fee{i,...,iv}j"#fc) ^an be 
achieved by time sharing between the policies that achieve 
throughput vectors G {O, Xi,Yj^k,j^k}- This establishes the 
result on the inner bound. ■ 

Fig. |5] illustrates the capacity region bounds from Proposi- 
tion |9] when iV = 2 and when N = S. 

For the special case of = 2 users and deterministic ARQ 
feedback delay, D = d, we obtain the exact capacity region 
of the genie-aided system and hence tighter bounds to the 
capacity region of the original system. 

Proposition 10: For N = 2 users, with a deterministic 
ARQ delay of D = d, d > slots, the capacity region of 
the genie-aided system is given by the set of points (xi,X2) 
such that 

ixi,X2) e ffconvcx(0,Xi,Zi,Z2,X2) 

where O = (0, 0) 

Xi = ips,0) 

X2 = {0,Ps) 

Z, = {p,T%) + {l-psfT\r), {l-ps)psT\p)) 
Z2 = {{l-ps)PsT\p),PsT''{p) + {l-psfT\r)). 

(40) 

Proof: The relative positions of the points 
Xi, X2, Zi, Z2 and O are illustrated in Fig. |6] 

We proceed by first showing that the region complementary 
to i?convcx(0, Xi, Zi, Z2, X2) is an outer bound on the capac- 
ity region of the genie-aided downlink. Consider a broad class 
of schedulers in the genie-aided system, with each member 
identified by the parameters ai G [0,1], i G {1,...,4}. A 
member of this class obeys the following decision logic at 
slot t: 
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(O.Ps) 
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um(2) Cum 


1 

\ 





TV = 2 



(Pa, 0) 



2 = cfSSrci) =p. 



C3um(2) C3uni(2) 




Outer bound 



Fig. 5. Illustration of bounds on the capacity region of the downlink with 
randomly delayed ARQ when N = 2 and when N = 3. 



If 



, then schedule user 1 with probabil- 



ity ai and user 2 w.p. 1 — ai. 



If 
If 
If 



St+d+i{'2) 

<S't+d+i(l) 
St+d+i{'2) 

<5't+rf+i(l) 
St+d+i{'2) 



then at 
then at 
then at 



1 w.p 

2 w.p 

1 w.p 

2 w.p 

1 w.p 

2 w.p 



a2 

1 ~ a2 
as 

1 - as 

1 — Q!4 



Outer bound from Proposition |9] 



Note that, thanks to the first order Markovian nature of the 
underlying channels, any scheduling policy in the genie-aided 
system falls under the above class of schedulers or will 
have a member of this class achieving the same throughput 
vector as itself. We now proceed to show that the throughput 
vector achieved by any member of this class belongs to 



X2 = (0, Pa) 




Xi = (pa, 0) 



Inner bound 

Capacity region of the genie-aided system 
and a tighter outer bound on the capacity 
region of the original system 



Fig. 6. Illustration of the capacity region of the genie-aided system and 
tighter bounds on the capacity region of the original system when N = 2, 
with deterministic ARQ delay. 



With a — {«!, . . . , ai} G [0, 1]'' fixed, the throughput for 
user 1 is given by 

Q.E^Cllic 





St+d+i{i) 




i 




( 


St+d+ii^)^ 




J, 


) 



P(at = 1 



St+d+i{'^) ^ i 

St+d+ii"^)] yi 

= [l-psfaiT''{r) + {l^p,)p,a2T\r) 
+ Ps{l~Ps)a3T\p)+plaiT\p), 

j^^y Similarly, 



)p{St{l) = l\St+d+l{l)=^) 
(41) 



with p 

a 



Q,scnic 



convex 



(O, Xi, Zi, ^2, X2 



= (1 - psfil - a,)T\r) + (1 - p,)p,(l - a2)T\p) 
+ ps{l - ps){l - as)T\r) +pI{1 - ai)T\p). 

(42) 

For notational simplicity, we will henceforth denote the 
throughputs simply by /ii and /Lt2- The sum throughput is now 
given by 

Ml + = + (1 - Ps)Ps{T\p) - T''{r)){as - aa). (43) 

Note that the values of ai and a^ are irrelevant from the 
sum throughput point of view. Consider the following two 
cases. 

Case 1, when a^ < 0:2.' 

< Hi + ^12 < Ps- 

Since Xi(l) + Xi{2) = X2(l) + X2(2) = p„ we have 

(Ail,M2) e i?convcx(0,Xi,X2). (44) 

Case 2, when a^ > a2: 

Ps<f^i+m < Ps + {l-Ps)Ps{T\p)-T''{r)) 
= PsT\p) + {I - ps)ps. 
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Since Zi(l) + Zi(2) - ^2(1) + ^2(2) = PsT'^ip) + (1 - 
p,)2r''(r) + (1 - Ps)PsT-'{p) = PsT'^b) + (1 - Ps)Ps, we 
can find points Ex^Zi and EX2Z2 on edges XiZi and X2Z2, 
respectively, such that E'xiZiCl) + Ex^Zii'^.) = i^XsZsll) + 
-^^222(2) = /^i + M2- Any point PxiZi on the edge XiZi 
can be written as a convex combination of points Xi and Zi, 
i.e., 3 ^ £ [0, 1] such that 

- + (p.T'^lp) + (1 - p.)^r'^(r))(l - 13), 

{l-Ps)psT\p){l~l3)] 



With /3 = 1 - (ag - 02), we have PxiZi(l) + Px^z^{2) = 
III + H2- Thus 



E 



,z,^{psil ~ (aa - a2)) + (p.T''(p) + (1 - p.)^T'^(r)) x 

(as - Q!2), (1 - Ps)PsT'^{p)ia3 - 0:2)) ■ 

Due to the symmetry between Xi, Z\ and X^, Z^, we have 
EX2Z2 = (-E'xiZiCS), £'xiZi(l))- Using /^i from gB, it can 
be shown that, for any aie{i...4} G [0, 1] with as > a-i. 



Ex2zA^)<lii<Ex,zA^). 



(45) 



Since i;xiZi(l) + £^XiZi(2) = i;x2Z2(l) + ^X2Z2(2) = + 
/i2, (EDl translates to 

(^1,^2) £ HconvcxiExiZi, Ex2Z2)- 

The above relation, along with the fact that i^XiZi G 
i?convox(-'^i, ^'i) and i;A'2Z2 G -ffconvox(-'^2, ^'2), yields 

(M17A^2) G ifconvcx(-^l,-Z'l,^2,-'^2)- (46) 

Combining the results in (l44l i and (|46] |, we establish that the 
region complementary to HconvcxiO, Xi, Zi, Z2, X2) is an 
outer bound on the capacity region of the genie-aided system. 

Revisiting the class of schedulers identified by a, it can 
be shown from and ( |42] | that a scheduler with a. = 
{1,0,1,1} achieves a throughput vector (/^i,/Z2) = Zi = 
{PsT^ip) + [l -p,fT''{T),{l ~ Ps)PsT''{p)). Similarly, a 
scheduler with a = {0,0,1,0} achieves a throughput vector 
(a*i,M2) = Z2 = {{l~Ps)PsT^ip),PsT'\p) + il-Ps)^T'\r)). 
Throughput vectors Xi or X2 can be achieved by scheduling 
to only user 1 or 2, respectively, at all times. Thus any through- 
put vector within the region Hconvcx{0, Xi, Zi, Z2, X2) can 
be supported by time sharing between the schedulers that 
achieve throughput vector G {O, Xi, Zi, ^2,^2}. This estab- 
lishes Hconvcx{0, Xi, Zi, Z2, X2) as an inner bound on the 
capacity region of the genie-aided system. 

Combining the outer and inner bound results establishes the 
proposition. ■ 

We now report tighter bounds on the capacity region of 
the original system, when N = 2 and the ARQ delay is 
deterministic. 

Corollary 11: For N = 2 users, with a deterministic ARQ 
delay of D — d, d > slots, an outer bound on the capacity 



region of the original system is given by the set of points 
{xi,X2) such that 

{xi,X2) i Hcanvcx{0,Xi,Zi,Z2,X2) 

where O (0, 0) 

Xi = (p.,0) 

X2 - (0,p.) 

Z, = {p,T\p) + {l-p,fT^(r),{l-p,)p,T\p)) 

Z2 = {{\-ps)PsT\p),PsT''{p) + {\-psfT\T)) 

(47) 

and an inner bound is given by the set of points (xi, X2) such 
that 

[xi,X2) G -ffconvox(0,Xi,Yi, 2,-^^2) 
, -v^ / C'sum(2) C'sum(2) , 

where yi,2 ^ ( , ) 

with Csum(2) = PsT'^{p) + (1 — ps)ps, the sum capacity of 
the system. 

Proof: The outer bound is the region complementary 
to the capacity region of the genie-aided system reported in 
Proposition[TO] The inner bound was obtained in Proposition |9] 
with Csum(2) from (|28] | re-derived using P{D = d) = 1. ■ 
Fig. |6] illustrates the improved outer bound from Corol- 
lary [TT] along with the bounds derived in Proposition |9] 

V. Conclusion 

We addressed the problem of opportunistic multiuser 
scheduling for a system consisting of a base station or access 
point transmitting to users within its domain. We model the 
downlink channels by two-state Markov chains, with ON and 
OFF states, and assume that the data destined for each user is 
infinitely backlogged. We allow for the ARQ feedback from 
each user to the base station to be randomly (i.i.d. over all 
users) delayed. For the case of two users in the system, we 
showed that the greedy policy is sum throughput optimal for 
any distribution of the ARQ feedback delay. However, for 
more than two users, there exists scenarios for which the 
greedy policy is not optimal. Nevertheless, extensive numerical 
experiments suggest that the greedy policy has near optimal 
performance. Encouraged by this, we studied the structure of 
the greedy policy and showed that it can be implemented 
by a simple algorithm that does not require the statistics of 
the underlying Markov channel nor the ARQ feedback delay, 
thus making it robust against errors in estimation of these 
statistics. Focusing on the fundamental limits of the downlink 
system, we obtained an elegant closed form expression for the 
sum capacity of the two-user downlink and derived inner and 
outer bounds on the capacity region of the Markov-modeled 
downlink with randomly delayed ARQ feedback. 

In summary, we addressed opportunistic multiuser schedul- 
ing based on existing ARQ feedback mechanisms, while taking 
into account an important non-ideality in the feedback channel 
- the random delay. We studied this scheduling problem by 
examining various aspects of the 'easy to implement' greedy 
policy and by establishing fundamental limits on the downlink 
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system performance. We believe that the work we have initi- 
ated here, along with the proof techniques we have developed, 
could be the first steps towards studying the joint channel 
learning - scheduling problem under more general scenarios: 
such as, when the users have heterogeneous demands, when 
the queues are non-backlogged with random packet arrivals. 

Appendix A 
Proof of Lemma[T] 

Recall the definition of the m— step belief evolution operator: 

T"{x) = T(r("-i)(j:)) = r("-i)(r(a;)) with T{x) ^ xp + 
(1 — x)r — x{p — r) + r and T^{x) = x for x G [0, 1] and 
u e {0, 1,2,.. .}. For M G {1, 2, . . .}, x e [0, 1], 

T"(p) = T("-i)(p)p+(l-r("-i)(p))r 
T("+i)(x) = T"{x)p+{1-T"{x))r 
_2^(«+i)(^) = (p-r)(T("-i)(p) -T"(x)).(48) 

Thus if, for u G {1,2,...}, r("-i)(p) - T"(x) > 0, then, 
since p > r, we have T"{p) — T^"^^\x) > 0. By induction, 
using p > T{x) = xp + [1 — x)r for any x G [0, 1], we have 
> T"+i(a;) for any u G {0,1,2,...} and x G [0,1]. 
The second inequality in the lemma can be proved along the 
same lines using r <T{x) ~ xp + [1 — x)r. 

Consider the third inequality. By definition, for any x,y E 
[0,1], T'^ix) - T'^iy) = (p - r)(r("-i)(a;) - r("~i)(y))- 
Thus, if r("-i)(a;) - T("-i)(y), then r"(x) - T''{y) > 0. 
When X > y, by induction, T"{x) — T^{y) > for any 
u G {0, 1,2,.. .}. This establishes the third inequality. 

Considering the last inequality, the belief evolution operator 
can be expressed as 

r"(x) ^ r(T("-i)(a;)) =r(T(r("-2)(a;))) 

1 - fn - r)" 

= x{p~r)-+r{—^ L) 

1 - (p - r) 

= l-(p-r)+^^-'^)"("~l-(p-r)) 
for u G {0, 1,2,.. .} and x G [0, 1]. Thus = - 



(49) 



i-(p-'-) 



l-(p-r) 



]. Note that, since p > r, T^{p) > 

Also, T>) = < 
This establishes the last inequality in the lemma. 



Appendix B 
Proof of Corollary [3] 
The proof proceeds closely follows that of Proposition |2] 
Recall the quantities fk,Fk,Tk,ki,k2, hjh from the proof of 
Proposition|2l Consider a slot t < m with the sequence of past 
actions given by at+i = {am ■ ■ ■ ot+i}- The proof proceeds in 
two steps. In step 1, we show that the greedy decision in slot t, 
given the ARQ feedback and the scheduling decision from slot 
min(/ci,fc2), is independent of the feedback and scheduling 
decision corresponding to slot max(A;i, ^2). In step 2, we show 
that, if the greedy policy is implemented in slot t, then the 
expected immediate reward in slot t is independent of the 
scheduling decisions at+i- We then provide induction based 
arguments to establish the proposition. 



Step 1: Let Ft+i := {iv„, F„,_i, . . . , i^t+i} and n+i := 
{tto, T,„_i, . . . , Tf+i}. The greedy decision in slot t, condi- 
tioned on the past feedback and scheduhng decisions is given 
by 



'^*lFt+i,rt+i,at+i, 



^Afk-i ,fk2 ,il,i2,St+i,7r„ 



(50) 



The preceding equation comes directly from the first order 
Markovian property of the underlying channels. Consider the 
case when ki < k2 < m (^ li < I2) or ki ^ k2 = % 
(=> h — h — 0)- The belief values in slot t as a function of 
feedback /fc^ and //;.-, is given below: 



'(t'Kp),t'^(p)), 




— 1 , //£2 — 


1 


(T'Kp),T'^(r)), 


if /fei 







(T'Hp),T("-*)(7r„(2))), 


if fk^ 


= l,fc2 = 


3 


(T'Hr),r'^(p)), 


if 


= 0, fk, = 


1 


(T'Hr),r'^(r)), 


if /fei 


^OJk, = 





(T'Hr),r('"-*)(7r„,(2))), 


if /fei 


= 0, fc2 - 


d 


.(T("-*)(^„(l)),T("-*)(^„ 


(2))), if fci = 


= 0, fc2 = 





(51) 



Now, from the definition of T''{.) and using the fact that p < 
r, the following ineualities can be readily verified. For u G 
{1,3, 5 . . .}, > u and a; G [0, 1], 



T"(p) 
T"(r) 



> 
< 



T^ix) 
T^ix). 



(52) 



For u G {0, 2, 4 . . .}, w > u and a; G [0, 1], 

T"(p) < T^'ix) 



(53) 



The preceding inequalities essentially result from the oscil- 
lating nature of the evolution of belief values in a negatively 
correlated Markov channel. Using these inequalities and ( BTI) . 
the greedy decision in slot t can be written as 



l/fci ,fk2 ,ll,h-at+i;Trm 

'1, if/fei=l 

2, if fk, = 

2. if fk, = 1 

1, if fk, - 



= < 



if ki ^ 0, h is odd 
if ki ^ 0, li is even 



argmaxigji 2}(7rm(j)), if m — i is even 
^ |^argmin,jg{i^2}(7i'm(«)), if m - i is odd 



ki,k2 = 



(54) 



Thus the greedy decision is independent of feedback /^^ if 
ki < k2- We now proceed to generalize equation ( l54l l. Let k* 
denote the latest slot for which an ARQ feedback is available 
from one of the users by slot t. Let I ^ k* —t—1 for k* ^% 
and I — for fc* = be a measure of freshness of the latest 
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ARQ feedback. Thus, using the preceding discussion, we have odd, 

E 



ak' , if fk* 



1 



Ofc. , if fk* = 
Ofc. , if fk* = 1 
flfc- , if fk* = 
arg max,,g{i_2}(7r„(i)) 
arg min,,g{i_2}(7r„(i)), 



,if fc* 7^ 0, fc* - t is even 
,if fc* =^ 0,fc* -t is odd 



if m — i is even 
if m — i is odd 



k* 



(55) 



= 7i'/+t+i(a/+t+i)r'(p) + (1 - 7r/+t+i(a/+t+i)) x 

(ni+t+iiai+t+i)T'-{p) + (1 - 7r/+t+i(a,+t+i))T'(r)^ 
= P({Wi(l)-lU Wi(2)-l}| 

+p({5,+t+i(i) = n s,+t+i{2) = o}| 

(59) 



where ak- is the user not scheduled in slot k* . This completes 
step 1 of the proof. 

Step 2: If the greedy policy is implemented in slot t, the im- 
mediate reward expected in slot t, conditioned on scheduling 
decisions at+i and initial belief tt™ can be rewritten as 

= E7Tt\i=(tiMt+u7r^{Rt{'^t,at))P{l = 0|at+i,7r„) 

(RtiiTu at)) -(56) 

Note that 



where Sk{i) is the 1/0 state of the channel of user i in slot k. 
Using similar arguments, when I is an even number. 



)r'(r) 



^Trt\ir,+t+l,l,l^<ll,St+i,-!VmiRt{Trt,a,t)) 

= P({Wi(l)-OU Wi(2) = 0}| 

+p{{Si+t+i{i)^inSi+t+im^i}\ 

(60) 

Now, from ( fSSl ), using arguments similar to those used for 
( fTSl ) in the proof of Proposition |2l we have 



ETt|;=0,3t+i,7r„(-Rt(7i't, at)) 



(57) 



since, with I ~ (k, i.e., no past feedback at the scheduler, the 
belief values at slot t is independent of the past scheduUng 
decisions and is simply given by tt* = r'^™~*)(7rm). Now 
rewriting the second part of i 



E;j5>£0|at+i,7r,„ E7rj|;,/5,£0,at+i,7r„(^t(7''tj Ot)) 

{(p{{Si+t+i{l) 
-P{{Si+t+,{l) 

if I is odd 
(p{{Si+t+i{l) 
-P{{Si+t+i{l) 
if I is even 



iyjSi+t+i{2) - l}|7r™)T'(p) 
On5,+t+i(2) = G}|^,„)r'(r) 



0U^/+t+i(2) = 0}|7r„)T'(r) 
in Wi(2) = l}k,n)T'(p)), 



E, 



= E, 



E,rt|/,i^0,at+i,7r„(-Rt(7rt, at)) 



(61) 



E,rt |7ri+t+i ,i,/5^0,at+i ,7r„ (-Rt (""t , at )) . 



(58) 



Now, along the lines of the proof of Proposition IH we average 
the expected reward over I and obtain 



Consider E^^|^,^^^^,,_,-^0^3^^^_^,^(i?t(7rf,at)). From the first 



E^j 

Y[ P{D{ak, k)>k-t-l) maxr("~*)(7r„(z)) 



t+1 



k—m 
ni — t — l 



t+i 



step of the proof, the greedy decision in slot t can be made 
solely based on the latest feedback, i.e., /fe.=/+t+i- This was 
recorded in dSSl ). 

Thus, when I is an odd number (equivalently k* — t is 
even), if the feedback /fc. is an ACK (occurs with probability 
7r/+t+i (a/+t+i)) reschedule the user a/+t+i in slot t. Con- 
ditioned on fk' = 1, the belief value 7rt(a;+f+i) and hence 
the expected immediate reward in slot t is given by T\p). If where the last quantity is given by ( |6TI ). Thus the expected 



P{D{1, l + t + l)< 1)Y[P{D{1, k)>k-t-l) 

1=0 k=t+l 

E/^;^0|St^i,,r„ E^j|/^;^0^3j^j_^^(i?t(7rf, at)) (62) 



the feedback is a NACK, schedule the other user denoted by 
a;+t+i. Conditioned on fk' — 0, the belief value 7rt(a;+f+i) 



reward in slot t is independent of the sequence of actions 
{a„i, a,„_i . . . at+i} if the greedy policy is implemented in 



and hence the expected immediate reward in slot t is given slot t. By extension, the total reward expected from slot t 



by T('+i)(7r,+t+i(a,+t+i)) = ^/+t+i (ai+t+i)T'(p) + (1 
7r/+t+i(a/+t+i))r'(r). Averaging over /fc.=/+t+i, when I is 



until the horizon is independent of the scheduling vector at+i 
if the greedy policy is implemented in slots — 1,...,1}, 
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I.e., 



by 



Rk{'^k,ak) = y^E^fc|7r,„ Rkij^k.ak)- 

(63) 



k=t 



k=t 



Thus, if the greedy policy is optimal in slots {t, t — 1, . . . , 1}, 
then, it is also optimal in slot t + 1. Since t is arbitrary and 
since the greedy policy is optimal at the horizon, by induction, 
the greedy policy is optimal in every slot {to, to — 1, . . . , 1}. 
This establishes the proposition. 



Appendix C 
Proof of Proposition[5] 

The proof proceeds in two steps: (1) We first construct a 
counterexample to the optimality of the greedy policy when 
the horizon, m ~ 4 and arbitraty number of users N;N>2, 
(2) Based on this counterexample, we then construct a more 
general counterexample, with arbitraty to; m > 3 and N;N> 
2. We proceed with the first step next. 

Assume an arbitrary number of users, N;N>2. Let the 
horizon to = 4. Assume a deterministic ARQ delay of one 
time slot, i.e., Poid ^ I) ^ 1 and Poid 7^ 1) 0. Let the 
users be indexed in decreasing order of their initial beliefs, 
i.e., 7rm(l) > 7r„j(2) > . . .Trm{N). The net expected reward 
corresponding to the greedy policy is given by 



v47T,,{ik}L,) 

= 7r4(l)+T(7r4(l)) 



=i[R2] + E/3,/4k4, 04=1, 03=1(^1] (64) 



/4|Tr4,a4 



Note that since the delay is one slot, the first ARQ feedback 
comes at the end of slot 3. Thus, the greedy decision in both 
slots 4 and 3 is user 1 . Also, the greedy scheduler has access to 
feedback /4 only, at the beginning of slot 2 and both feedback 
/4 and /a, at the beginning of slot 1. Therefore, R2 is averaged 
over /4 and Ri is averaged over /4 and f^. The average total 
reward under greedy policy can thus be evaluated by averaging 
over all realizations of /4 and f^. Table [Vjlists the belief values 
of the three users in slots 2 and 1 for various values of { f^, f^} 
along with the greedy decisions and immediate rewards in slots 
2 and 1. Note from the table that the belief value 7r2 at slot 
2 is a function of /4 only, while tti at slot 1 is a function of 
both /4 and fs, consistent with the preceding discussion. 

The probabilities of occurrence of the various realizations 
of {/4, /a} are summarized below 



'^a{1)p, if {/4,/3} = {1,1} 

7r4(l)(l-p), if {/4, /a} = {1,0} 

(l-^4(l))r, if {/4,/3} = {0,l} 

,(l-^4(l))(l-r), if {/4, /a} = {0,0}. 



F4(7r4,{Sfc}Ll) 

= 7r4(i) + r(^4(i)) + ^4(i)p(2r(p)) 

+7r4(l)(l-p)(r(p)+r3(7r4(2))) 
+ {l-n,{l))r{T\iT,{2))+T{p)) 
+ (1 - 7r4(l))(l - r)(T2(7r4(2)) + T\^i{2))) (66) 

Now, with a^. indicating the optimal decision in slot fc, 
consider the following policy such that 04 = 1,03 = 
2,0,2 — 02,0,1 = a'l. Since the ARQ delay is deterministic 
and equals one slot, the decision in slot 2 does not affect the 
reward in slot 1. Thus the greedy policy is optimal in slot 
2. Trivially, greedy policy is optimal in slot 1, as well. Thus 
0-2 — ^2, o,* — oi. The average total reward under 9^ is given 
by 



(65) 



7r4(l) + r(^4(2)) + E/,|,,,,,^i[i?2] 

+ %3,/4k4,Q4 = l, 03=21-^1] 

Ml) + T{th{2)) + Ef^i^^^a,=i[R2] 



+ %3,/4k4,a4 = l,a3=2[-Rl] 



(67) 



Thus the net expected reward under the greedy policy is given 



We evaluate V4(7r4, {'^k}t=i) along the lines of the greedy net 
expected reward evaluation. Table [VT] summarizes the beliefs, 
scheduling decision Ok and immediate rewards in slots 2 and 
1 for all the realizations of {/4,/3} when {04,(13} = {1,2}. 
Users are once again ordered according to their initial belief 
values, i.e., 7r4(l) > 7r4(2) > 7r4(3). Note from the table that 
the belief value tt2 at slot 2 is a function of /4 only, while tti 
at slot 1 is a function of both and f^, consistent with the 
ARQ delay profile. 

The probabilities of occurrence of the various realizations 
of {fi, fa} when 04 = 1, aa = 2, are summarized below. 

P{h, h) 

'7r4(l)r(7r4(2)), if {/4, /a} = {1,1} 

7r4(l)(l-T(^4(2))), if {/4,/3} = {l,0} 

(l-7r4(l)m^4(2)), if {/4, /a} = {0,1} 

. (1 - 7r4(l))(l - r(^4(2))), if {/4, fs} = {0, 0}. 

(68) 

Thus, the net expected reward under policy 9fe is given by 

1/4(714, {9}Li) 

- ^4(1) + T{tt4{2)) + 7r4(l)T(7r4(2))(2r(p)) 
+n,{l){l-T{7r,{2))){T{p)+T^{p)) 
+ (l-^4(l))n^4(2))(T2(^4(2))+T(p)) 
+ (1 - ^4(1))(1 - T(7r4(2)))(T2(7r4(2)) + T^(7r4(3))) 

(69) 

We now proceed to show that, for N = 3, deterministic ARQ 
delay D = 1 and horizon to = 4, 3 p,r, tt4 such that the net 
expected reward corresponding to policy is strictly higher 
than that of the greedy policy. The difference in reward, after 
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{/4,/3} 


7r2 at slot 2 


0,2 


^2 


TTi at slot 1 


ai 




{1.1} 




T{P) 
T2(^4(3)) 




1 






T{p) 

r3(^4(2)) 
r3(^4(3)) 




1 




{1.0} 




Tip) 
T2(^4(2)) 
T2(^4(3)) 




1 






T(r) 

T3(^4(2)) 

r3(^4(3)) 




2 


T3(7r4(2)) 


{0.1} 




r(r) 

T2(^4(2)) 

r2(^4(3)) 




2 


r'(T4(2)) 




np) 




1 




{0.0} 




T(r) 
T2(^4(2)) 




2 






T(r) 
r3(^4(2)) 

r3(^4(3)) 




2 


T3(7r4(2)) 



TABLE V 

Belief values, scheduling decisions, immediate rewards in slots 2 and 1 for various realizations of ARQ feedback under the 

greedy policy. 



{/4,/3} 


7r2 at slot 2 


52 


^2 


TTI at slot 1 


di 


^1 


{1.1} 




np) 

T2(^4(2)) 
T2(^4(3)) 




1 






n{p) 
np) 
n(M^)) 




2 


Tip) 


{1.0} 




np) 

T2(^4(2)) 
T2(^4(3)) 




1 






tHp) 

T(r) 

r3(^4(3)) 




1 


T^ip) 


{0.1} 




r(r) 

T2(^4(2)) 

T2(^4(3)) 




2 


^2(^4(2)) 




T2(r) 

np) 
n{M3)) 




2 


Tip) 


{0.0} 




r(r) 

T2(^4(2)) 

t2(^4(3)) 




2 


r'(T4(2)) 




THr) 

T(r) 

n(M3)) 




3 


T3(7r4(3)) 



TABLE VI 

Belief values, scheduling decisions, immediate rewards in slots 2 and 1 for various realizations of ARQ feedback under policy 



p 


r 




V4(7r4,{t}t=i) 


F4(7r4,{a}4^i) 


Vi(7r4,{t}t,=J 
-Vi(^4,{9}4^i) 


0.9308 


0.1797 




0.5216 
0.5130 
0.3305 




2.6368 


2.6141 


0.0227 


0.8875 


0.0186 




0.3416 
0.3310 
0.2648 




1.6155 


1.5454 


0.0701 



TABLE Vn 

Sample system parameters when the greedy policy is suboptimal. Number of users N 

m = 4 IS USED. 



= 3, DETERMINISTIC DELAY D = 



1, HORIZON 
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algebraic manipulations is given by 

F4(^4,{i}Ll)-^4(^4,WLi) 

= b-r)(^4(2)-^4(l) 

+ (p - 7f (1 - 7r4(l))^4(3)(l - r - (p - r)^4(2))) . 

(70) 

For the special case 714(1) — 7r4(2) = i, we have 



y4(^4,{t}Li)-vi(7r4,{9}Li) 




For any p < 1, since p > r, ¥4(1:4, {'S-jf^^^) > 
T4(7r4, V 714(3) > 0. With the net expected reward of 

the optimal policy being no less than ¥4(114, we see 

that the greedy policy is not in general optimal. Table IVIII 
lists a few other values of p, r, 114 for which the greedy 
policy is suboptimal. This establishes a counterexample for 
the optimality of the greedy policy when N > 2. 

A more general counterexample for an arbitrary horizon 
length m can be constructed based on the one thus established. 
We proceed with this constructuion in the sequel. As before, 
asssume 7r„i(l) > vr„i(2) > ...TTm(N) and a deterministic 
ARQ delay of one time slot, i.e., PD(d = 1) = 1. With 
Sk(i) indicating the underlying state of the channel of user 
i in slot k, consider the following realization of the channel: 

7^l = {SUi) = = l,.-.^5(i) - i;^,„(2) = 

l,5'm-i(2) = 1,...S'5(2) = 1}. Recall the policy ^ from 
above. We define a variant of this policy, B, as follows. 
Policy 33 performs greedy scheduling in slots {to ... 5}. Under 
realization TZi, policy 33, being a greedy scheduler, schedules 
user 1 in slots {m...5}. Thus the realization {5m (1) = 
= 1,...56(1) — 1} is observable by poUcy 30 
by the beginning of slot 4. From slot 4, under realization TZi, 
define policy B such that it behaves along the lines of policy 9 
defined earlier Also, policy 33 performs greedy scheduling in 
all slots {m ... 1} under all channel realizations other than TZi. 
Thus the reward difference between policy 33 and the greedy 
scheduler 3 is given by the difference in slots {4 ... 1}, under 
realization TZi, weighted by the probability of this realization. 
Formally, 

= Prob{7^l} X 

(y4(7r4, {i}Li) - V4(tt4, mLi)) (72) 

with 7:4 - {p,p,T'"-4(^,„(3)),...r'"-4(7r„(iV))}. Note 
that the belief values 1:4 reflect the realization TZi and the 
greedy nature of both policies until slot 5. Now, since policy 
33 is defined to behave like policy !3 from slot 4, we can use 
the reward difference expression in ( iTOl i to simplify ( f72] i. as 



below. 

= Prob{7^l}((p-r)(^4(2)-^4(l) 

+ (p-7f (l-^4(l))7r4(3) X 

(1 - 7" - (P - ?-)7r4(2)))) |^^(i)^^^(2)=p 

= Prob{7ei}((p - rf(l -p)((l-r)-(p- r)p)^ 

(73) 

Note that the last equality is always positive rendering the 
greedy policy 9 suboptimal. This establishes a more general 
counterexample to the optimality of the greedy policy when 
N > 2. The proposition is thus proved. 

Appendix D 
Proof of Proposition!!] 

Consider the case when N — 2 and the ARQ feedback is 
instantaneous (end of slot), i.e., D = 0. Let pi, Vi indicate the 
Markov channel probabilities for user i e {1,2}. Let pi ~ 
P2 > ri > r2- Assume horizon length to = 2. Let 7r2(l) > 
7r2(2). The total reward under greedy decision in the current 
slot k = 2 and optimal reward at the horizon, i.e., fc = 1, is 
given by 

¥2 = 7r2(l) + 7r2(l)max(pi,T(^2(2))) 
+ (l-7r2(l))max(ri,r(7r2(2))) 
= 7r2(l) + 7r2(l)pi + (1 - ^2(l))max(ri,T(^2(2))) 

(74) 

where we have used the fact that the greedy policy is optimal 
in the last slot and pi > T(tt2(1)) > T(7r2(2)). 

The total reward when the non-greedy decision is made in 
the current slot and optimal decision is made in the last slot 
is given by 

¥2 - 7r2(2) + 712(2) max(p2,T(^2(l))) 
+ (l-7r2(2))max(r2,r(7r2(l))) 
= 7r2(2) + 7r2(2)p2 + (1 - 7r2(2)) max(ri, T(^2(l))) 

(75) 

where we have used P2 = Pi > T(tt2(1)) and r2 < 
r(7r2(2)) < r(7r2(l)). Now considering the special case when 
r(7r2(2)) > ri, we have, with algebraic manipulations, 

¥2 '¥2 - (^2(1) -^2(2)) 

-(ri -r2)(l-7r2(l))(l-7r2(2)). (76) 

In Table IVIIII we provide numerical examples consistent with 
the setup assumed above, yielding negative values for ¥2 — 
¥2- This establishes that the greedy policy is not, in general, 
optimal, when the Markov channels are non-identical, even 
when the number of users, N = 2 and the ARQ delay is 
instantaneous. The proposition is thus proved. 
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Pi = P2 


ri 


r2 


TTl 


vr2 


T(^2(l)) 


T(7r2(2)) 


1^2 -V2 


0.5060 


0.1411 


0.1054 


0.2276 


0.2179 


0.2241 


0.1926 


-0.0119 


0.6333 


0.3952 


0.1296 


0.5864 


0.5861 


0.5348 


0.4248 


-0.0452 



TABLE Vin 

Sample system parameters when the greedy policy is suboptimal under non-identical Markov channels. Number of users N = 2, 

instantaneous arq feedback d = 0, horizon m = 2 are assumed. 



Appendix E 
Key Quantities 
N : Number of users in the downlink 
p : P(channel is ON in the current slot | 

channel was ON in the previous slot) 
r : P (channel is ON in the current slot | 

channel was OFF in the previous slot) 

m : Horizon 

7rt(i) : Belief value of user i in slot t 

T"(.) : M-step belief evolution operator 

afc : Index of the user scheduled in slot k 

3fc : Scheduling policy appUed in slot k 

3fc : Greedy scheduling policy applied in slot k 

fk : Feedback originating from slot k 

Ft : Feedback arriving at slot t 

D{i,t): Delay of feedback from user i in slot t 

Pd{-) '■ Probability mass function of i.i.d delay D 

Vt : Net expected reward in slot t 

Csum : Sum capacity of the downlink 

^fum^ • Sum capacity of the genie-aided downlink 

: Throughput of user i under scheduling pohcy 3 
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